Descriptive statistics

summarise variable(s)

The main function for calculating summaries on variables is summarise. Examples of descriptive functions are mean, median, sum etc. The functions consume a vector and produce a single value. summarise takes a tibble along with specification of descriptives and produces a single row.

For example, let’s say we want to know the mean height and weight of all individuals in the pulse dataset:

pulse %>% summarise(meanHeight=mean(height), meanWeight=mean(weight))
# A tibble: 1 × 2
  meanHeight meanWeight
       <dbl>      <dbl>
1       172.       66.3

The result is a single row with two variables meanHeight and meanWeight with the corresponding mean values of all observations.

We can also summarise on variable’s range, e.g. age :

pulse %>% summarise(minAge = min(age), maxAge=max(age)) # <=> range(pulse$age)
# A tibble: 1 × 2
  minAge maxAge
   <dbl>  <dbl>
1     18     45

n(): convenient function to calculate total number of rows in the summarise context:

pulse %>% summarise( count = n(), meanHeight = mean( height ) )
# A tibble: 1 × 2
  count meanHeight
  <int>      <dbl>
1   110       172.

count : frequency tables

With the count function we can count the frequency of values in a categorical variables:

pulse %>% count(gender)   # frequency of male/female
# A tibble: 2 × 2
  gender     n
  <chr>  <int>
1 female    51
2 male      59
pulse %>% count(smokes)   # frequency of smoking habit 
# A tibble: 2 × 2
  smokes     n
  <chr>  <int>
1 no        99
2 yes       11
pulse %>% count(exercise) # frequency of exercise habit 
# A tibble: 3 × 2
  exercise     n
  <chr>    <int>
1 high        14
2 low         37
3 moderate    59

The result enumerates the distinct values of the variable in the first column and their frequency in a new column n.

Multiple variables are allowed, it is the count of each possible combination of values, also known as contingency table or cross table:

pulse %>% count(gender, exercise)
# A tibble: 6 × 3
  gender exercise     n
  <chr>  <chr>    <int>
1 female high         3
2 female low         20
3 female moderate    28
4 male   high        11
5 male   low         17
6 male   moderate    31
pulse %>% count(year, gender)
# A tibble: 10 × 3
    year gender     n
   <dbl> <chr>  <int>
 1  1993 female    12
 2  1993 male      14
 3  1995 female    11
 4  1995 male      11
 5  1996 female    10
 6  1996 male      11
 7  1997 female     8
 8  1997 male      15
 9  1998 female    10
10  1998 male       8

distinct values in variables

To identify distinct values in a variable or a group of variables we use the function distinct:

pulse %>% distinct(year)
# A tibble: 5 × 1
   year
  <dbl>
1  1993
2  1995
3  1996
4  1997
5  1998
pulse %>% distinct(exercise)
# A tibble: 3 × 1
  exercise
  <chr>   
1 moderate
2 high    
3 low     
pulse %>% distinct(ran) 
# A tibble: 2 × 1
  ran  
  <chr>
1 sat  
2 ran  

Again, multiple variables are allowd. To identify distinct combinations of gender and exercise:

pulse %>% distinct(gender, exercise)
# A tibble: 6 × 2
  gender exercise
  <chr>  <chr>   
1 female moderate
2 female high    
3 male   high    
4 female low     
5 male   low     
6 male   moderate

‘distinct’ produces the same variables combinations as the ‘count’ function except without the frequncy column ‘n’.

You may use distinct also to check whether certain variables have unique values for each observation. Let’s for example check whether all individuals in the pulse dataset have different names, more precisely, each observation is uniquely identifiable by the variable name:

pulse %>% nrow()                    # total number of rows 
[1] 110
pulse %>% distinct(name) %>% nrow() # count the number of distinct names
[1] 106

There are 106 distinct names and there in total 110 observations in the pulse dataset. This could only mean that there are multiple individuals in the pulse dataset with shared names:

nrow(pulse) == nrow( pulse %>% distinct(name)) # is 'name' unique for all observations?
[1] FALSE

arrange

You may sort rows according to one or more variables with the function arrange.

Try sorting the pulse dataset by name:

pulse %>%  arrange(name) # sorts the rows by name in dictionary order 
# A tibble: 110 × 13
   id     name  height weight   age gender smokes alcohol exerc…¹ ran   pulse1 pulse2
   <chr>  <chr>  <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>   <chr>  <dbl>  <dbl>
 1 1996_C Adel…    157     41    20 female no     no      modera… ran       70     95
 2 1996_P Adri…    180    102    20 male   no     yes     modera… sat       76     72
 3 1997_O Albe…    194    110    25 male   no     no      modera… sat       75     75
 4 1993_V Arle…    140     50    34 female no     no      low     ran       70     98
 5 1998_O Bett…    161     43    19 female no     no      low     sat       90     89
 6 1995_F Bobby    180     85    19 male   yes    yes     modera… ran       68    125
 7 1995_L Bobby    169     68    19 male   no     no      modera… sat       58     58
 8 1993_A Bonn…    173     57    18 female no     yes     modera… sat       86     88
 9 1996_F Bran…    171     67    18 female no     yes     low     sat       76     74
10 1996_K Brid…    160     49    19 female no     no      low     sat       80     72
# … with 100 more rows, 1 more variable: year <dbl>, and abbreviated variable name
#   ¹​exercise

or by height

pulse %>%  arrange(height) # numerical order
# A tibble: 110 × 13
   id     name  height weight   age gender smokes alcohol exerc…¹ ran   pulse1 pulse2
   <chr>  <chr>  <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>   <chr>  <dbl>  <dbl>
 1 1998_J Raul      68     63    19 male   no     no      modera… ran       88    136
 2 1998_N Lizz…     93     27    19 female no     no      low     sat      119    120
 3 1993_V Arle…    140     50    34 female no     no      low     ran       70     98
 4 1997_A Katr…    151     42    22 female no     no      low     ran       85    130
 5 1993_T Maura    155     50    19 female no     no      modera… sat       78     79
 6 1995_N Tisha    155     49    18 female no     yes     modera… sat      104     92
 7 1998_G Ursu…    155     55    20 female no     yes     high    sat       82     87
 8 1996_C Adel…    157     41    20 female no     no      modera… ran       70     95
 9 1996_J Pene…    158     51    18 female no     no      modera… ran       68     84
10 1995_G Laur…    160     57    19 female no     no      modera… ran       75    130
# … with 100 more rows, 1 more variable: year <dbl>, and abbreviated variable name
#   ¹​exercise

By default the data is sorted in ascending order, to sort in descending order use desc function:

pulse %>%  arrange(desc(name))
# A tibble: 110 × 13
   id     name  height weight   age gender smokes alcohol exerc…¹ ran   pulse1 pulse2
   <chr>  <chr>  <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>   <chr>  <dbl>  <dbl>
 1 1997_C Will…    190   82      19 male   no     no      modera… sat       76     73
 2 1997_F Wesl…    172   53      20 male   no     no      low     ran       72    136
 3 1998_G Ursu…    155   55      20 female no     yes     high    sat       82     87
 4 1993_X Tyro…    182   75      26 male   yes    yes     modera… sat       80     76
 5 1993_J Troy     168   60      23 male   no     yes     modera… ran       88    150
 6 1993_D Trav…    195   84      18 male   no     yes     high    sat       71     73
 7 1996_B Trav…    167   70      22 male   yes    yes     low     sat       92     84
 8 1995_N Tisha    155   49      18 female no     yes     modera… sat      104     92
 9 1997_I Tim      170   58.5    20 male   no     no      low     sat       80     82
10 1996_M Tayl…    180   77      18 female no     no      modera… ran       47    136
# … with 100 more rows, 1 more variable: year <dbl>, and abbreviated variable name
#   ¹​exercise

You may also arrange by multiple variables:

pulse %>%  arrange(height,weight)
# A tibble: 110 × 13
   id     name  height weight   age gender smokes alcohol exerc…¹ ran   pulse1 pulse2
   <chr>  <chr>  <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>   <chr>  <dbl>  <dbl>
 1 1998_J Raul      68     63    19 male   no     no      modera… ran       88    136
 2 1998_N Lizz…     93     27    19 female no     no      low     sat      119    120
 3 1993_V Arle…    140     50    34 female no     no      low     ran       70     98
 4 1997_A Katr…    151     42    22 female no     no      low     ran       85    130
 5 1995_N Tisha    155     49    18 female no     yes     modera… sat      104     92
 6 1993_T Maura    155     50    19 female no     no      modera… sat       78     79
 7 1998_G Ursu…    155     55    20 female no     yes     high    sat       82     87
 8 1996_C Adel…    157     41    20 female no     no      modera… ran       70     95
 9 1996_J Pene…    158     51    18 female no     no      modera… ran       68     84
10 1996_K Brid…    160     49    19 female no     no      low     sat       80     72
# … with 100 more rows, 1 more variable: year <dbl>, and abbreviated variable name
#   ¹​exercise

Here the data is first ordered by height and then by weight.



Copyright © 2023 Biomedical Data Sciences (BDS) | LUMC